feat: add operational counters for sampling, requirements, and tools (#467) by ajbozarth · Pull Request #883 · generative-computing/mellea

ajbozarth · 2026-04-17T23:01:39Z

Misc PR

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue: Fixes Add counters for operational metrics specific to Mellea's sampling and validation loops #467

Adds six new OpenTelemetry counters as part of the telemetry epic (#443),
giving operators visibility into retry behaviour, validation failure rates,
and tool call health alongside the existing LLM-level metrics:

mellea.sampling.attempts / mellea.sampling.successes / mellea.sampling.failures — tagged by strategy class name
mellea.requirement.checks / mellea.requirement.failures — tagged by requirement class name and failure reason
mellea.tool.calls — tagged by tool name and status ("success" / "failure")

Follows the established plugin pattern: lazy-init counter globals + record_*
helpers in metrics.py, consumed by three new Plugin classes in
metrics_plugins.py (SamplingMetricsPlugin, RequirementMetricsPlugin,
ToolMetricsPlugin) that hook into existing hook events in FIRE_AND_FORGET
mode. No metric calls in business-logic files.

Also extends SamplingIterationPayload and SamplingLoopEndPayload with a
strategy_name field (backward-compatible, defaults to "") so plugins can
tag counters with the strategy class name.

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

…enerative-computing#467) Adds six new OpenTelemetry counters giving operators visibility into retry behaviour, validation failure rates, and tool call health: mellea.sampling.attempts/successes/failures, mellea.requirement.checks/failures, and mellea.tool.calls. Follows the established lazy-init globals + record_* helpers + Plugin hooks pattern. Extends SamplingIterationPayload and SamplingLoopEndPayload with a strategy_name field so plugins can tag counters by strategy class. Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

github-actions · 2026-04-17T23:01:49Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

PeriodicExportingMetricReader background threads (60 s default) would fire after pytest closed stdout once the suite crossed the 60 s mark, causing "I/O operation on closed file" and OTLP UNAVAILABLE errors. Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>

ajbozarth · 2026-04-17T23:24:03Z

This is parallel work to #882 which adds the other remaining metrics telemetry and whichever merges second will need a manual merge to deal with the overlapping git diffs.

akihikokuroda · 2026-04-20T17:55:56Z

This looks good to me as the implementation of the issue #467. I wonder how these counters are useful. They are grouped by strategies and requirements. They may not give useful insight without some more fine groupings.

akihikokuroda

LGTM. I have a general comment about the issue that this PR implements.

ajbozarth · 2026-04-20T18:05:44Z

This looks good to me as the implementation of the issue #467. I wonder how these counters are useful. They are grouped by strategies and requirements. They may not give useful insight without some more fine groupings.

@jakelorocco @psschwei I actually wondered about this while implementing. I thought the idea was interesting and valuable, but my work did not create an example of how it could be used (which maybe I should add?)

psschwei · 2026-04-21T12:40:29Z

This looks good to me as the implementation of the issue #467. I wonder how these counters are useful. They are grouped by strategies and requirements. They may not give useful insight without some more fine groupings.

@jakelorocco @psschwei I actually wondered about this while implementing. I thought the idea was interesting and valuable, but my work did not create an example of how it could be used (which maybe I should add?)

Off the top of my head, it could be useful to know if a particular requirement is hard to satisfy for a particular model (for example, maybe base granite doesn't do well on quantum and we need to substitute in the granite-qiskit model).

I'm not sure we necessarily need an example though (I'm a little worried about having too many examples to maintain, xref #811 ). Not opposed either.

psschwei

Found one issue with metric cardinality — each unique combination of attribute values creates a separate time series in your metrics backend.

Step 1: A requirement validation fails. The ValidationResult object has a .reason field containing a human-readable explanation of why it failed.

Step 2: The new RequirementMetricsPlugin (added in this PR) fires on the validation_post_check hook. It grabs reason from the result:

  reason = getattr(result, "reason", None) or "unknown"
  record_requirement_failure(req_name, reason)

Step 3: record_requirement_failure (also added in this PR) passes that reason string directly as a counter attribute:

  _get_requirement_failures_counter().add(
      1, {"requirement": requirement, "reason": reason}
  )

Step 4: Here's the problem. For some requirement types — specifically LLM-as-a-Judge (mellea/core/requirement.py:187, pre-existing code) — the reason is set to the full LLM output text. That's a different string nearly every time. Each unique reason string creates a new time series in the metrics backend.

So if you run 1,000 LLM-as-a-Judge validations that fail, you get ~1,000 distinct time series for mellea.requirement.failures, each with a count of 1. Metrics backends like Prometheus have cardinality limits and will start dropping data or degrading performance.

What you'd want instead: Either drop reason from the metric attributes (put it in traces/logs instead), or map it to a small bounded set of categories like "llm_judge", "execution_error", "timeout", etc.

ajbozarth self-assigned this Apr 17, 2026

github-actions bot added the enhancement New feature or request label Apr 17, 2026

ajbozarth changed the title ~~feat: add operational counters for sampling, requirements, and tools …~~ feat: add operational counters for sampling, requirements, and tools (#467) Apr 17, 2026

ajbozarth marked this pull request as ready for review April 17, 2026 23:03

ajbozarth requested a review from a team as a code owner April 17, 2026 23:03

ajbozarth requested review from akihikokuroda and avinash2692 April 17, 2026 23:03

ajbozarth mentioned this pull request Apr 17, 2026

feat: add pricing registry and cost metrics (#464) #882

Open

9 tasks

akihikokuroda approved these changes Apr 20, 2026

View reviewed changes

ajbozarth mentioned this pull request Apr 20, 2026

ModelOutputThunk field refactor: partition fields into sub-structures #793

Open

psschwei reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add operational counters for sampling, requirements, and tools (#467)#883

feat: add operational counters for sampling, requirements, and tools (#467)#883
ajbozarth wants to merge 2 commits intogenerative-computing:mainfrom
ajbozarth:feat/operational-counters-467

ajbozarth commented Apr 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

ajbozarth commented Apr 17, 2026

Uh oh!

akihikokuroda commented Apr 20, 2026

Uh oh!

akihikokuroda left a comment

Uh oh!

ajbozarth commented Apr 20, 2026

Uh oh!

psschwei commented Apr 21, 2026

Uh oh!

psschwei left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ajbozarth commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Misc PR

Type of PR

Description

Testing

Attribution

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

ajbozarth commented Apr 17, 2026

Uh oh!

akihikokuroda commented Apr 20, 2026

Uh oh!

akihikokuroda left a comment

Choose a reason for hiding this comment

Uh oh!

ajbozarth commented Apr 20, 2026

Uh oh!

psschwei commented Apr 21, 2026

Uh oh!

psschwei left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ajbozarth commented Apr 17, 2026 •

edited

Loading